skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Groot_Koerkamp, Ragnar"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract MotivationSampling k-mers is a ubiquitous task in sequence analysis algorithms. Sampling schemes such as the often-used random minimizer scheme are particularly appealing as they guarantee at least one k-mer is selected out of every w consecutive k-mers. Sampling fewer k-mers often leads to an increase in efficiency of downstream methods. Thus, developing schemes that have low density, i.e. have a small proportion of sampled k-mers, is an active area of research. After over a decade of consistent efforts in both decreasing the density of practical schemes and increasing the lower bound on the best possible density, there is still a large gap between the two. ResultsWe prove a near-tight lower bound on the density of forward sampling schemes, a class of schemes that generalizes minimizer schemes. For small w and k, we observe that our bound is tight when k≡1(mod w). For large w and k, the bound can be approximated by 1w+k⌈w+kw⌉. Importantly, our lower bound implies that existing schemes are much closer to achieving optimal density than previously known. For example, with the current default minimap2 HiFi settings w = 19 and k = 19, we show that the best known scheme for these parameters, the double decycling-set-based minimizer of Pellow et al. is at most 3% denser than optimal, compared to the previous gap of at most 50%. Furthermore, when k≡1(mod w) and the alphabet size σ goes to ∞, we show that mod-minimizers introduced by Groot Koerkamp and Pibiri achieve optimal density matching our lower bound. Availability and implementationMinimizer implementations: github.com/RagnarGrootKoerkamp/minimizers ILP and analysis: github.com/treangenlab/sampling-scheme-analysis. 
    more » « less